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Chapter 1 

Introduction for "An Introduction to 
Wavelet Analysis" 1 



In this chapter, we give an overview on multiresolution analysis, wavelet series and wavelet estimators in 
the classical setting. By 'classical' or 'first-generation' wavelets, we mean wavelets that were constructed 
initially to analyze signals observed at equispaced design points and having a sample size which is a power 
of two. The 'second-generation' wavelet basis presented in the subsequent chapters will release these two 
constraints. 

If one wants to analyze a function of time with a series expansion, the first idea that comes probably into 
one's mind is to use a Fourier series, i.e. decompose the function into sine and cosine at different frequencies. 
In this process, we hope that only a few coefficients in the series will carry most of the information about the 
signal. Certain smooth functions have such an 'economical' Fourier expansion. However, for most functions, 
a good Fourier series approximation requires numerous sine and cosine basis functions. Indeed, the sine 
functions have a precise frequency but are not localized in time, hence a localized information in the signal 
like a discontinuity will affect all the coefficients of the series. This drawback lead the researchers to look 
for more efficient bases, that is, bases which are localized both in time and in frequency. We will see in 
Multiresolution analysis and wavelets (Chapter 3) that a wavelet basis offers this property. 

This chapter is structured as follows. We begin by recalling some notations in Function spaces: notion and 
notations (Chapter 2). Next Multiresolution analysis and wavelets (Chapter 3) introduces the multiresolution 
analysis, the wavelet functions, and gives some simple examples of wavelet bases. Fast wavelet transform 
(Chapter 4) explains how to decompose a signal using the wavelet transform. Such wavelet transforms, also 
called 'decimated', lack the property of translation-invariance. Non-decimated wavelet transform (Chapter 5) 
presents a widely used trick to make a wavelet transform translation- invariant. Since the main goal of a 
wavelet series is to provide a good approximation of a function belonging to a given space, Approximation of 
Functions (Chapter 6) introduces some fundamental notions to measure the quality of such approximation. 
Finally, Nonparametric regression with wavelets (Chapter 7) presents how to construct a nonparametric 
regression estimator using wavelets. First, the classical case of equally spaced design is considered. In the 
last part of Nonparametric regression with wavelets (Chapter 7), we review some existing methods that deal 
with irregular designs. 
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CHAPTER 1. INTRODUCTION FOR "AN INTRODUCTION TO WAVELET 

ANALYSIS" 



Chapter 2 

Function spaces: notion and notations 1 



A Hilbert space is a complete normed space whose norm is indexed by an inner (or scalar) product. 

Two disjoint subspaces A and B of a space S form a direct sum decomposition of S if every element of S 
can be written uniquely as a sum of an element of A and an element of B. The notation S = A © B is then 
used. 

A measurable function / belongs to the Lebesgue space L p (R) , 1 < p < oo if 

+00 \ Vp 

\f(x)\ p ) <oo. (2.1) 

■00 / 

An example of a Hilbert space is the Lebesgue space L2 (R) of measurable and square integrable functions. 
Indeed, the norm || ■ || 2 is induced by the scalar product 



<f,9>= / f(x)g(x)dx, (2.2) 



where g (x) denotes the complex conjugate of g(x). Two functions are said to be orthogonal in Li (K) if 
their inner product is zero. 

The Lebesgue measure can be replaced by a more general measure /i, leading to the weighted space 
L2 (/z), which has as inner product 



<f,g> d „=f(x)g(x)du(x) (2.3) 



and which contains the functions that have a finite norm ||/|| d „ := y/< f , / >^„ < 00. 

A countable subset {fk} of functions belonging to a Hilbert space is a Riesz basis if every element / of 
the space can be written uniquely as / = ^ k Ckfk, and if positive constants A and B exist such that 

A\\f\\l<Y J \^\ 2 < B Wf\\l- ( 2 - 4 ) 

k 

A Riesz basis is an orthogonal basis if the fk are mutually orthogonal. In this case, A = B = 1. 
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CHAPTER 2. FUNCTION SPACES: NOTION AND NOTATIONS 



Chapter 3 

Multiresolution analysis and wavelets 1 



3.1 Definition of subspaces and of scaling functions 

A natural way to introduce wavelets is through the multiresolution analysis. Given a function / s L 2 (K), a 
multiresolution of L 2 (R) will provide us with a sequence of spaces Vj, Vj+i, ... such that the projections of 
/ onto these spaces give finer and finer approximations (as j — » oo) of the function /. 

Definition: (Multiresolution analysis (MRA) in the first generation) A multiresolution anal- 
ysis of L 2 (R) is defined as a sequence of closed subspaces Vj C L 2 (R) , j € Z with the following 
properties: 

1. 

... C V_i C V C Vi. C ... (3.1) 

2. The spaces Vj satisfy 

(J Vj is dense in L 2 (R) and f] Vj = {0} . (3.2) 

3. If / (a;) € Vb, / (2 J x) e Vj, i.e. the spaces Vj are scaled versions of the central space Vq. 

4. If / € Vb, / (. — k) € Vb, fc € Z, that is, Vb (and hence all the Vj) is invariant under translation. 

5. There exists <p g Vb such that {ip (x — k) ;k e Z} is a Riesz basis in Vb- 

We will call 'level' of a MRA one of the subspaces Vj. From Definition, p. 5, it follows that, for fixed j, the 
set {ifjk (x) = 2^ 2 tf {Vx — k) ; k € Z} of scaled and translated versions of <p is a Riesz basis for Vj. Since 
</C € Vb C Vi, we can express ^asa linear combination of {yi^}: 

V( a; ) = ^2 h k <p hk (x) = V2^ h k ip (2x - k) . (3.3) 

feez fcez 

(3.3) is called the two-scale equation or refinement equation. It is a fundamental equation in MRA 
since it tells us how to go from a fine levelVi to a coarser levelVb- The function ip is called the scaling 
function. 

As said before, the spaces Vj will be used to approximate general functions. This will be done by defining 
appropriate projections onto these spaces. Since the union of all the Vj is dense in L 2 (R) , we are guaranteed 
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6 CHAPTER 3. MULTIRESOLUTION ANALYSIS AND WAVELETS 

that any given function of Li (R) can be approximated arbitrarily close by such projections. As an example, 
define the space Vj as 



Vj = {feL 2 (R);VkeZ,f\ 



2-J'fc,2-->'(fc+l) Inconstant} (3.4) 



Then the scaling function ip (x) = l[o,i) (%), called the Haar scaling function, generates by translation and 
dilatation a MRA for the sequence of spaces {Vj,j € Z} defined in (3.4), see [36], [16]. 

3.2 The detail space and the wavelet function 

Rather than considering all the nested spaces Vj, it would be more efficient to code only the information 
needed to go from Vj to Vj + \. Hence we consider the space Wj which complements Vj in Vj + i : 

Vj+i = V, © Wj . (3.5) 

The space Wj is not necessarily orthogonal to Vj, but it always contains the detail information needed to 
go from an approximation at resolution j to an approximation at resolution j + 1. Consequently, by using 
recursively the equation (3.5), we have for any j e Z, the decomposition 



L 2 (R) = V jo ®®? =j Wj. (3.6) 

With the notational convention that Wj -\ := Vj a , we call the sequence 
{H^} > _! a multiscale decomposition (MSD). 

We call %jj a wavelet function whenever the set {tp (x — k) ; k s Z} is a Riesz basis of Wo- Since Wo C V\, 
there also exist a refinement equation for ip, similarly to (3.3): 

4> (x) = V2^g k y {2x - k) . (3.7) 

fe 

The collection of wavelet functions {ipjk = 2^ 2 ip {Vx — k) ; k s Z, j s Z} is then a Riesz basis for Li (R). 
One of the main features of the wavelet functions is that they possess a certain number of vanishing moments. 

Definition: A wavelet function ip(x) has TVvanishing moments if Jtp(x)x p dx = 0, p = 
0,...,N-1. 

We now mention two interesting cases of wavelet bases. 

3.3 Orthogonal bases 

In an orthogonal mult iresolut ion analysis, the spaces Wj are defined as the orthogonal complement of 
Vj in Vj+i- The following theorem tells us one of the main advantages of such a MRA. 

Theorem: ([16], Theorem 5.1.1) If a sequence of closed subspaces (Vj) ._ z in L^ (R) satisfies 
Definition, p. 5, and if, in addition, {ip (x — k) ,k s Z} is an orthogonal basis for Vq, then there 
exists one function ip (x) such that {tp (x — k) ; k e Z} forms an orthogonal basis for the orthogonal 
complement Wo of Vb in V\. 

An immediate consequence of Theorem, p. 6 is that {ipjk,k e Z} constitutes an orthogonal basis for the 
orthogonal complement Wj of Vj in Vj+\. In this section, let Vj (resp. Qj) be the orthogonal projection 
operator onto Vj (resp. Wj). The orthogonal expansion 

/ = rjj + ET^Qjf m 

= E fe < / . Wo* > <pj ,k + Y%L jo E fe < / . 1>jk > ipjk 



tells us that a first, coarse approximation of / in Vj is further refined with the projection of / onto the 
detail spaces Wj. 

Figure 3.1 shows two examples of orthogonal wavelet functions. The first is the Haar wavelet, associated 
to the Haar scaling function defined in "Definition of subspaces V j and of scaling functions" (Section 3.1: 
Definition of subspaces and of scaling functions) . 

V(z) Haar = 2- 1 / 2 (^ Haar (2* - 1) - ^ Haar (2s)) = l [hl) (x) - l [oi) (a:) . (3.9) 

The Haar wavelet has only one vanishing moment and consequently is optimal only to represent functions 
having a low degree of regularity, like, for example, (3— Holder functions with < (3 < 1. 

Daubechies constructed in [15], [16] compactly supported wavelets which have more than one vanishing 
moment. Compactly supported wavelets are desirable from a numerical point of view, while having more 
than one vanishing moment allows to reconstruct exactly polynomials of higher order. These wavelets cannot, 
in general, be written in a closed analytic form. However, their graph can be computed with arbitrarily high 
precision using a subdivision scheme algorithm. Figure 3.1(b) represents the Daubechies Least Asymmetric 
wavelet with N = 4 vanishing moments. 
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(a) 




(b) 

Figure 3.1: Some orthogonal basis functions: (a) the Haar wavelet function bases with N — 1 vanishing 
moments, (b) the Least Asymmetric wavelet function of Daubechies [16], [15], with N — 4 vanishing 
moments, (a) (a) N = 1 (b) (b)JV = 4 



This figure also illustrates the reason behind the name 'wavelet': since wavelets are functions with a 
certain number of vanishing moments, they have the shape of a 'little wave' or 'wavelet'. 



3.4 Biorthogonal bases 

Having an orthogonal MRA puts strong constraints on the construction of a wavelet basis. For example, the 
Haar wavelet is the only real- valued function which is compactly supported and symmetric. However, if we 
relax orthogonality for biorthogonality, then it becomes possible to have real- valued wavelet bases of fixed 
but arbitrary high order (see Definition 1 from Approximation of Functions (Definition, p. 17)) which are 
symmetric and compactly supported [13]. In a biorthogonal setting, a dual scaling function <p and a dual 
wavelet function ip exist. They generate a dual MRA with subspaces Vj and complement spaces Wj such 



9 

that 

Vj ± Wj and V,- ± Wj . (3.10) 

In other words, 

<<p,ip(--k)>=0 and < tp,i>(- -k) >= (3.11) 

Moreover, the dual functions also have to satisfy 

< <p,<p{-- k) >=6 k ,o and < ip , ip (• - k) >= 5 k ,o , (3.12) 

where <5fc r o ls the Kronecker symbol. By construction, the dual scaling and wavelet functions satisfy a 
refinement equation, similarly to the equations (3.3) and (3.7). 

In this work, we use the following convention: the dual MSD will be used to decompose a function (or a 
signal), while the original, or primal MSD reconstructs the function. This yields the following representation 
of a function / G L^ (R) 

DC 

k j=jo k 

Figure 3.2 shows an example of a biorthogonal wavelet basis built by Cohen, Daubechies and Feauveau in 
[13], (called CDF- wavelets hereafter). 
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Figure 3.2: Primal and dual scaling and wavelet functions for the (3,l)-Cohen-Daubechies-Feauveau 
(CDF) biorthogonal basis. The primal wavelet function V nas one vanishing moment while the dual 
wavelet if) has three vanishing moments. 



Chapter 4 

Fast wavelet transform 



4.1 One-dimensional wavelet transform 

Suppose we are given as signal the projection of a function onto the space Vj + \: 

V 3 + lf = X] Sj+l,k<Pj+l,k (x) , S j+lt k =< f , <fij+l,k > • (4-1) 

k 
Using the dual refinement equations, we have: 

Sj,k=< f ,<Pj,k> = < f ,12lhlipj+l,2k+l > , s 

where the coefficients Sjk are called scaling coefficients, since they are related to scaling functions. 
Similarly, the wavelet or detail coefficientsdj^ are obtained as 

djk =< f , ipjk >= 5Z 9l-2kSj+i,l ■ (4-3) 

k 

The coefficients Sjk and djk are obtained from Sj+i j by 'moving average' schemes, using the filter coefficients 
{hi} and {g{\ as 'weights', with the exception that these moving averages are sampled only at the even 
integers, i.e. a downsampling is performed. Such transform allows, once we have computed s,j^=< f , <fij,k > 
for a fine level J e N, to compute Sjk and djk for all coarser levels j < J without evaluating the integrals. 

Suppose now we are given the values of / at n = 2 J equispaced design points. The scaling functions 
V?j.fc: k = 0, ..., 2 J — 1, are compactly supported and localized around 2~ J k. Hence the coefficients < / ,<pj.k > 
are weighted and scaled average of / on a neighborhood of 2~ J k which becomes smaller as J tends to infinity. 
Consequently, it makes sense to replace the integral < f ,(pj t k > by the (scaled) value of / at 2~ J k. More 
complicate quadrature formulae have been developed in [44], [45], [46]. 

With Sj := {sjk; k = 0, ..., 2 J — 1} and dj := {djk', k = 0, ..., 2 J — 1}, the forward (or analyzing) wavelet 
transform given by (4.2)-(4.3) can be rewritten as 

Sj = HjSj+i and dj = G*Sj + i , (4.4) 

where H* denotes the Hermitian conjugate of Hj. 
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CHAPTER 4. FAST WAVELET TRANSFORM 



The inverse (or synthesis) transform is found by using the primal refinement equations and the fact that 

V J+1 = V j( BW 3 . 



"Pj+lf = J2l Sj+l,l<Pj+l,l = J2k s j,kfj,k + J2k dj,k1pj,k 

= J2k s jm J2i hupj+iak+i + J2k d i,k J2i gi<Pj+i,2k+i 

= T,i tPj+1,1 (Sfe hi-2kSj,k + J2k 9i-2kdjk) , 

from which it follows that 

s j + l,l = /_^ hl-2kSjk + 2_^9l-2kdjk ■ 

k k 

In matrix form, we have 



(4.5) 



(4.6) 



Sj +1 = HjSj + Gjdj . (4-7) 

In the finite and classical setting, the matrices Hj, Gj, Hj and Gj are of size 2 J+1 x V . Moreover, if the 
basis functions are compactly supported, the four filters (hi, gi, hi, gi) have only a finite number of nonzero 
elements, and hence all these matrices are banded. 



4.1.1 Example: Haar wavelet transform 

In case of the orthogonal Haar transform, H* = H* and is of the form 

h hi 



H* 



h hi 



h hi 



(4i 



since only ho and hi are different from zero : ho = hi = 1/V2- The high-pass filter {g{\ is such that 
go = — l/y/2 and gi = \j\pl. The forward transform (4.2)-(4.3) reduces to 



Sj,k - -7sSj+l t 2k+l + ~m «j+l,2fc 



V2 



V2- 



d 



■j,k - ^«j+l,2fe+l - ^Sj+lflk , 



(4.9) 



and the reconstruction is given by 



Sj + l,2k 
s j + l,2k+l 



1 - l_ rl 

^2 b J,k ^2 a 3,k 

V2 S ^ k + V2 d J> k ■ 



(4.10) 



4.2 Two-dimensional wavelet transform 

The wavelet transform has been successfully applied to compress images, which are modelled as functions 
defined on a regular two-dimensional grid. 
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Figure 4.1: Two-dimensional wavelet transform: first the filters are applied on the column of the 
matrix Sj, this produces two matrices. The filters are applied a second time on the columns of these two 
matrices, resulting in four elements: a matrix of scaling coefficients, and three detail matrices. 



The easiest way to build a two-dimensional MRA is probably to use tensor products of spaces, see [17], 
[37]. In terms of wavelet transforms, this leads to applying two times a one-dimensional transform: first on 
the 'row' of the signal matrix Sj, and second on the 'columns' of the resulting two matrices, see Figure 4.1. 
In this figure, we see that, at each level of the decomposition, three types of detail coefficients are produced: 
Dj, D v - and D d ,. These superscripts recall that, in an image, horizontal edges will lead to large values of 
Dj, vertical edges will show up in W" and D^ will be sensitive to diagonal lines. 

However, such a transform is not able to compress efficiently an image that contains curves. More complex 
bidimensional bases are now proposed in the literature to better model discontinuities along curves, see for 
example [12], [31], [41], [48]. 
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Chapter 5 

Non-decimated wavelet transform 



Suppose we have some signal {tji} observed at some equispaced design points : y, = / (i/n) , i = 1, ..., n with 
n = 2 J , J g N. The transform presented in the previous section is sometimes called 'decimated' because, for 
each scale j, the coefficients djk give only some information about the signal near the positions x = 2~^k, 
and not near all the existing design points 2~ J k = k/n. 

For this reason, the decimated wavelet transform lacks the property of translation invariance: given 
to € R, the wavelet decomposition of / (.) and of / (. — to) are in general completely different. This may lead 
to some unwanted pseudo-Gibbs oscillations near a discontinuity which is not localized at a dyadic point 
[14]. 

One remedy to this drawback consists in using a non-decimated wavelet transform (NDWT) , also called 
translation- invariant (TI) [14] or stationary [40]. The idea behind the NDWT is to a perform a discrete 
wavelet transform, not only of the original sequence {j/i}™ =1 , but of all the possible shifted sequences (Shy) t = 
yit+h) mod n- In terms of wavelet functions, this transform corresponds to a set of functions 

& fc (x) = # (2* (a; - 2- J k)) , j = jo, ..., J- 1, k = 0, ...,2 J - 1 . (5.1) 

At a given scale j, the NDWT coefficients are thus present at all the locations k/n for k = 1, ...,n and give 
information about the signal at each observed design point. In other words, the non-decimated transform 
fills in the gap introduced in the decimated transform, see Figure 5.1. 
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CHAPTER 5. NON-DECIMATED WAVELET TRANSFORM 




o 



o o o 



o o o 



Figure 5.1: Schema illustrating the translation-invariant version of the Haar transform. The points 
marked by • are the one computed for the decimated Haar transform. At level J, one circulant shift 
is performed: the first observation is put at the end of the observed signal, and a second decimated 
transform is performed on the shifted data (yielding the points marked by o at level J — 1). This process 
is iterated at the coarser levels, producing detail coefficients at all the points. 



Since we have J — j scales and at each scales n detail coefficients, the NDWT gives an overdetermined 
representation of the original signal {yi}™ =1 and the wavelet coefficients {djk,j = 0, ..., J — 1, k = 1, ..., n} 
are related to many different bases. Therefore the inverse operator will not be unique. A particular inverse, 
the average basis, corresponds to systematically average out the inverse wavelet transform obtained from 
each decimated wavelet transform that constitutes the translation-invariant transform. This makes the 
reconstruction robust with respect to a bad choice of a particular basis. Moreover, this average basis provides 
a smoother reconstruction than the original, decimated, transform [14], [6]. 

It allows for a (nearly) exact reconstruction of piecewise linear functions, instead of piecewise constant 
functions for the decimated Haar transform [14]. 



Chapter 6 

Approximation of Functions 1 



We first give a definition of the order of a multiresolution analysis. 

Definition: (Order of a MRA in the classical setting) A multiresolution analysis is said to be of 
order TV if the primal scaling function ip reproduces polynomials up to degree TV — 1, i.e., For < 
p < TV, 3ck € K such that x p = J2k Cki P i x ~ k) ■ 

The associated dual wavelet ip has then TV vanishing moments. In the classical setting, it is proved that the 
order of a MRA and the regularity of the scaling function are linked: the larger TV, the higher the regularity 
of ip. Symmetrically to Definition, p. 17, the order of the dual MRA is TV if <p can reproduce polynomials 
up to degree TV — 1. Figure 2 from Multiresolution analysis and wavelets (Figure 3.2) shows an example of 
a biorthogonal basis where TV = 3 and TV = 1. It illustrates the link between a high number of vanishing 
moments of the dual wavelet ip and the regularity of the primal scaling function ip. 

The main objective when decomposing a function in a wavelet series is to create a sparse representation 
of the function, that is, to obtain a decomposition where only a few number of detail coefficients are 'large', 
while the majority of the coefficients are close to zero. By 'large', we mean that the absolute value of the 
detail coefficient is large. 

Near a singularity, large detail coefficients at different levels will be needed to recover the discontinuity. 
However, between points of singularity, we can hope to have small detail coefficients, in particular if the 
analyzing wavelets ipjk have a large number TV of vanishing moments. Indeed, suppose the function / to be 
decomposed is analytic on the interval / without discontinuity. Since < x p ,ipjk >= for p = 0, ...,TV — 1, 
we are sure that the first TV terms of a Taylor expansion of / will not give a contribution to the wavelet 
coefficient < / , ipjk > provided that the support of ipjk does not contain any singularities of the function /. 

This sparse representation explains why classical wavelets provide smoothness characterization of function 
spaces like the Holder and Sobolev spaces [18], but also of more general Besov spaces, which may contain 
functions of inhomogeneous regularity [27], [22], [21], [19], [20]. 

We illustrate this characterization property with the case of (3— Holder functions. 
Definition 2 
The class A" (L) of Holder continuous functions is defined as follows: 

1. if (3 <1, A? (L) = {f:\f (x)-f(y)\< L\x-yf} 

2. if/3> 1,A /3 (L) = {/: |/ (L/3J) (a;)-/(L/ 3 J)(y)| < L\x - yf ;\f m) {x)\ < M}, where [[3\ is the largest 
integer less than (3 and = (3 — \J}\ . 

The global Holder regularity of a function can be characterized as follows [10], [18]. 

Theorem: Let / e A' 3 (L), and suppose that the (orthogonal) wavelet function ip has r continuous 
derivatives and r vanishing moments with r > (3. Then 
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\< f ,^k >\ < C2^^ +1 ^ . (6.1) 

A similar characterization exists for continuous and Sobolev functions [18], [27]. 

In the orthogonal setting, the wavelet ip must be regular and have a high number of vanishing moments. 
On the contrary, in the biorthogonal expansion equation 5 from Multiresolution analysis and wavelets (3.13), 
it is mostly of interest to have a dual wavelet ip with a high number of vanishing moments, and hence a regular 
primal scaling and wavelet functions. On the primal side, it is sufficient to have only one vanishing moment 
for wavelet denoising, and consequently ip may not be very regular. In this case, the wavelet coefficient 
< / > i'jk > with the less regular wavelet tpjk can be used to characterize /eA* 5 (L) with < (3 < N, even 
if /3 > N = 1: with a biorthogonal basis, regular functions can be characterized by their inner products with 
much less regular functions. 



Chapter 7 

Nonparametric regression with wavelets 1 



In this section, we consider only real- valued wavelet functions that form an orthogonal basis, hence <p = (p and 
%j) = tp. We saw in Orthogonal Bases from Multiresolution analysis and wavelets (Section 3.3: Orthogonal 
bases) how a given function belonging to Li (R) could be represented as a wavelet series. Here, we explain 
how to use a wavelet basis to construct a nonparametric estimator for the regression function m in the model 

Yi = m(xi) + Si, i = l,...,n, n = 2 J , J e N , (7.1) 

where Xi = - are equispaced design points and the errors are i.i.d. Gaussian, £, ~ N (0,<r^). 

A wavelet estimator can be linear or nonlinear. The linear wavelet estimator proceeds by projecting 
the data onto a coarse level space. This estimator is of a kernel-type, see "Linear smoothing with wavelets" 
(Section 7.1: Linear smoothing with wavelets). Another possibility for estimating m is to detect which detail 
coefficients convey the important information about the function m and to put equal to zero all the other 
coefficients. This yields a nonlinear wavelet estimator as described in "Nonlinear smoothing with wavelets" 
(Section 7.2: Nonlinear smoothing with wavelets). 

7.1 Linear smoothing with wavelets 

Suppose we are given data (xi, 5^)" =1 coming from the model (7.1) and an orthogonal wavelet basis generated 
by {ip,ijj}. The linear wavelet estimator proceeds by choosing a cutting level j\ and represents an estimation 
of the projection of m onto the space Vj 1 : 

2 J 0-l^ ji-12»-l~ 

m{x) = Y^ s Jo,k<Pj ,k ( X )+J2Y1 d i^i- k ( x ) = XI s h,k ( Ph,k (x) , (7.2) 

fc=0 j=j k=0 k 

with jo the coarsest level in the decomposition, and where the so-called empirical coefficients are computed 

as 

s ^ k = ~ X] Yi Vi k ( x ») and d J' k = ~ X Yi ^i k ( Xi ^ ■ ( 7 - 3 ) 

The cutting level j\ plays the role of a smoothing parameter: a small value of j\ means that many detail 
coefficients are left out, and this may lead to oversmoothing. On the other hand, if j\ is too large, too many 
coefficients will be kept, and some artificial bumps will probably remain in the estimation of m(x). 



1 This content is available online at <http://cnx.Org/content/ml7396/l.3/>. 
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To see that the estimator (7.2) is of a kernel-type, consider first the projection of m onto Vj 1 : 

V Vj m{x) = Y, k {f m (y) l Pji,k(y)dy)(p ju k(x) , ?4 . 

= jK n {x,y)m{y)dy , 

where the (convolution) kernel Kj 1 (x,y) is given by 

K h ( x ,y) = ^vjuk (y) <Ph,k (x) . (7.5) 

k 

Hardle et al. [28] studied the approximation properties of this projection operator. In order to estimate 
(7.4), Antoniadis et al. [4] proposed to take: 

V Vji m (x) = £?=i Y, / ( V" 1)/n K h (x, y) dy ^ ^ 

= E fc E?=i Y i (f(!-i)/ n Viuk (y) dy) <p jl<k (x) . 

Approximating the last integral by —<Pj lt k {x%), we find back the estimator m (x) in (7.2). 

By orthogonality of the wavelet transform and Parseval's equality, the L^— risk (or integrated mean square 
error IMSE) of a linear wavelet estimator is equal to the l 2 — risk of its wavelet coefficients: 



IMSE = E\\ m -m|||, = £ fe £ 



S jo,k Sj 0> k 



^3=30 ^k 



djk ~ d° k 



2 



(7.7) 



+ ^7=31 Efc d °3k 2 - Si + S 2 + S 3 , 

where 



s° k :=< m,ipjk > and d° k =< m,ipjk > (7.8) 

are called 'theoretical' coefficients in the regression context. The term Si + S 2 in (7.7) constitutes the 
stochastic bias whereas 63 is the deterministic bias. The optimal cutting level is such that these two 
bias are of the same order. If m is j3— Holder continuous, it is easy to see that the optimal cutting level is 
ji (n) = O (n 1 /( 1+2 ^ 3 )). The resulting optimal IMSE is of order n~ 2 p+ 1 . In practice, cross-validation methods 
are often used to determine the optimal level j\ [4], [39]. 

7.2 Nonlinear smoothing with wavelets 

7.2,1 Hard-, soft-thresholding and wavelet estimator 

Given the regression model (7.1), we can decompose the empirical detail coefficient djk m (7-3) as 

djk = lYl=i m { x i)^3k{xi)+ iElLi^fe(^) ( 7 _ 9 ) 

= djk + Pjk 

If the function m (x) allows for a sparse wavelet representation, only a few number of detail coefficients 

djk contribute to the signal and are non-negligible. However, every empirical coefficient djk nas a non-zero 
contribution coming from the noise part pj k - 

Remark: Note the link between the coefficients djk in (7.9) and the theoretical coefficients d° k 
in (7.8): 



ijk = iY^=l m { x i)^3,k{Xi) 

= J m (x)^ k (x)dx + 0(±)=d° k + 0(±) 
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(7.10) 



In words, djk constitutes a first order approximation (using the trapezoidal rule) of the integral d° k . For 
the scaling coefficients s° k , it can be proved [47] that the order of accuracy of the trapezoidal rule is equal 
to N — 1, where N is the order of the MRA associated to the scaling function. 

Suppose the noise level is not too high, so that the signal can be distinguished from the noise. Then, from 
the sparsity property of the wavelet, only the largest detail coefficients should be included in the wavelet 
estimator. Hence, when estimating an unknown function, it makes sense to include only those coefficients 
that are larger than some specified threshold value t: 



Vh djk,t = djkl 



(7.11) 



{\d jk \>t} 



This 'keep-or-kill' operation is called hard thresholding, see Figure 7.1(a). 

Now, since each empirical coefficient consists of both a signal part and a noise part, it may be desirable 
to shrink even the coefficients that are larger than the threshold: 



djk'-= Vs djk,t 



sign djk 




(7.12) 



Since the function r/s is continuous in its first argument, this procedure is called soft thresholding. More 
complex thresholding schemes have been proposed in the literature [3], [8], [26]. They often appear as a 
compromise between soft and hard thresholding, see Figure 7.1(b) for an example. 




Figure 7.1: In (a) the hard thresholding is represented in plain line: a coefficient djk with an absolute 
value below t is put equal to zero. The soft thresholding is given in dashed line: there coefficients 
with absolute value above the threshold t are shrunk of an amount equal to i. In (b), a more complex 
thresholding procedure, the SCAD threshold devised in Antoniadis and Fan [3] is represented. 



For a given threshold value t and a thresholding scheme rjt), the nonlinear wavelet estimator is given by 
m ( x ) = ^2 s jok <Pj„k{x) + ^2v(.)[djk,t\ tpjk(x) , (7.13) 

k j,k V / 

where jo denotes the primary resolution level. It indicates the level above which the detail coefficients 
are being manipulated. 
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Let now dj = {djk,k = O,...^- 7 — 1} denote the vector of empirical detail coefficients at level j and 
similarly define Sj. In practice a nonlinear wavelet estimator is obtained in three steps. 

1. Apply the analyzing (forward) wavelet transform on the observations {1^}™ =1 , yielding Sj and dj, for 
j = jo,..., J- 1. 

2. Manipulate the detail coefficients above the level jo, e.g. by soft-thresholding them. 

3. Invert the wavelet transform and produce an estimation of m at the design points: {m (#j)}" =1 . 

If necessary, a continuous estimator m can then be constructed by an appropriate interpolation of the 

estimated m [xj) values [23]. 

The choice of the primary resolution level in nonlinear wavelet estimation has the same importance as 
the choice of a particular kernel in local polynomial estimation, i.e., it is not the most important factor. It 
is common practice to take jo = 2 or jo = 3, although a cross-validation determination is of course possible 
[39]. 

The selection of a threshold value is much more crucial. If it is chosen too large, the thresholding operation 
will kill too many coefficients. Too few coefficients will then be included in the reconstruction, resulting in 
an oversmoothed estimator. Conversely, a small threshold value will allow many coefficients to be included 
in the reconstruction, giving a rough, or undersmoothed estimator. A proper choice of the threshold involves 
thus a careful balance between smoothness and closeness of fit. 

In case of an orthogonal transform and i.i.d. white noise, the same threshold can be applied to all detail 
coefficients, since the errors in the wavelet domain are still i.i.d. white noise. However, if the errors are 
stationary but correlated, or if the transform is biorthogonal, a level-dependent threshold is necessary to 
obtain optimal results [33], [7]. Finally, in the irregular setting, a level and location dependent threshold 
must be utilized. 

Many efforts have been devoted to propose methods for selecting the threshold. We now review some of 
the procedures encountered in the literature. 

7.2.2 Choice of the threshold 

7.2.2.1 Universal threshold 

The most simple method to find a threshold whose value is supported by some statistical arguments, is 
probably to use the so-called 'universal threshold' [23], [24] 

^univ = cr d ^2logn , (7.14) 

where the only quantity to be estimated is a\, which constitutes the variance of the empirical wavelet 
coefficients. In case of an orthogonal transform, ad = a e /y/n. 

In a wavelet transform, the detail coefficients at fine scales are, with a small fraction of exception, 
essentially pure noise. This is the reason why Donoho and Johnstone proposed in [25] to estimate ad in a 

robust way using the median absolute deviation from the median (MAD) of dj-i- 

dj-i - median dj-i 



median 

°d = 



0.6745 " • (? - 15) 

If the universal threshold is used in conjunction with soft thresholding, the resulting estimator possesses a 

noise-free property: with a high probability, an appropriate interpolation of {m (xi)} produces an estimator 
which is at least as smooth as the function m, see Theorem 1.1 in [23]. Hence the reconstruction is of good 
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visual quality, so that Donoho and Johnstone called the procedure 'VisuShrink' [25]. Although simple, this 
estimator enjoys a near-minimax adaptivity property, see "Adaptivity of wavelet estimator" (Section 7.4: 
Adaptivity of wavelet estimator). However, this near-optimality is an asymptotic one: for small sample size 
^univ may be too large, leading to a poor mean square error. 

7.2.2.2 Oracle inequality 

Consider the soft-thresholded detail coefficients d ■ Another approach to find an optimal threshold is to look 
at the l 2 — risk 

Tl(d,dj =EY,(d jk -d jk \ =E\\d-d\\l, (7.16) 

and to relate this risk with the one of an ideal risk 7?-ideai- The ideal risk is the risk obtained if an oracle 
tells us exactly which coefficients to keep or to kill. 

In [24], Donoho and Johnstone showed that, when using the universal threshold, the following oracle 
inequality prevails 

n(d,d\<{2logn+l)(^ + TZ idea ,ij . (7.17) 

However, this inequality is not optimal. Donoho and Johnstone looked for the optimal threshold t* (n) 
which leads to the smallest possible constant A^ in place of 2logn + 1. Such a threshold does not exist in 
closed form, but can be approximated numerically. For small sample size, it is sensibly smaller than the 
universal threshold. 

7.2.2.3 SureShrink procedure 

Given the expression (7.16) for the ^-risk, it is natural to look for a threshold that minimizes an estimation 
of this risk. 

By minimizing Stein's unbiased estimate of the risk [42] and using a soft thresholding scheme, the resulting 
estimator, called 'SureShrink', is adaptive over a wide range of function spaces including Holder, Sobolev, 
and Besov spaces, see "Adaptivity of wavelet estimator" (Section 7.4: Adaptivity of wavelet estimator). That 
is, without any a priori knowledge on the type or amount of regularity of the function, the SURE procedure 
nevertheless achieves the optimal rate of convergence that one could attain by knowing the regularity of the 
function. 

7.2.2.4 Other thresholding procedures 

We mention some of the other thresholding or shrinkage procedures proposed in the literature. 

Instead of considering each coefficient individually, Cai et al. [9], [11] consider blocks of empirical wavelet 
coefficients in order to make simultaneous shrinkage decisions about all coefficients within a block. 

Another fruitful idea is to use the Bayesian framework. There a prior distribution is imposed on the 
wavelet coefficients dj k - This prior model is designed to capture the sparseness of the wavelet expansion. 
Next, the function is estimated by applying some Bayes rules on the resulting posterior distribution of the 
wavelet coefficients, see for example [2], [5], [34], [35]. 

Antoniadis and Fan [3] treat the problem of selecting the wavelet coefficients as a penalized least squares 
issue. Let W be the matrix of an orthogonal wavelet transform and Y := {^}™ =1 . The detail coefficients 
d := {rfjfc} which minimize 

\\WY-d\\l+J2p^\djk\) (7.18) 
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are used to estimate the true wavelet coefficients. In equation (7.18), p\ (•) is a penalty function which 
depends on the regularization parameter A. The authors provide a general framework, where different 
penalty functions correspond to different type of thresholding procedures (like, e.g., the soft- and hard- 
thresholding) and obtain oracle inequalities for a large class of penalty functions. 

Other methods include threshold selection by hypothesis testing [1], cross-validation [38], or generalized 
cross-validation [30], [32], which is used to estimated the ^-risk of the empirical detail coefficients. 

7.3 Linear versus nonlinear wavelet estimator 

In order to differentiate the behaviours of a linear and of a nonlinear wavelet estimator, we consider the 
Sobolev class W s (C) defined as 

W s q (C) = {/ : ||/||9 + || £^f (x) ||« < C 2 } , (7.19) 

and that we denote V in short. Assume we know that m, the function to be estimated, belongs to V. In 
the next section, we will release this assumption. The L p — risk of an arbitrary estimator T n based on the 
sample data is defined as i?||T n — m||L 1 < p < oo, whereas the L p — minimax risk is given by 

R n (V,p) = infsupE\\T n -m\\P (7.20) 

T„ m£V 

where the infimum is taken over all measurable estimators T n of m. Similarly, we define the linear 
L p — minimax risk as 

R l ™ (V,p) = infsupE\\T^ - m||P, (7.21) 

TH n m£V 

where the infimum is now taken over all linear estimators T^ n . Obviously, R l ™ (V,p) > R n (V,p) . We first 
state some definitions. 

Definition: The sequences {a n } and {&„} are said to be asymptotically equivalent and are 
noted a n ~ b n if the ratio a n /b n is bounded away from zero and oo as n — > oo. 

Definition: The sequence a n is called optimal rate of convergence , (or minimax rate of con- 
vergence) on the class V for the L p — risk if a n ~ R n (V,p) . We say that an estimator m n of m 
attains the optimal rate of convergence if sup E\\m n — m||| ~ R n (V,p) ■ 

m£V 

In order to fix the idea, we consider only the L^— risk in the remaining part of this section, thus p := 2. 

In [29], [43], the authors found that the optimal rate of convergence attainable by an estimator when 
the underlying function belongs to the Sobolev class W£ is a n = n 2 ^ 1 , hence R n (V,2) = n^+i. We 
saw in "Linear smoothing with wavelets" (Section 7.1: Linear smoothing with wavelets) that linear wavelet 
estimators attain the optimal rate for s— Holder function in case of the L 2 — risk (also called TMSE'). For 
a Sobolev class W* , the same result holds provided that q > 2. More precisely, we have the two following 
situations. 

1. If q > 2, we are in the so-called homogeneous zone. In this zone of spatial homogeneity, linear 
estimators can attain the optimal rate of convergence rr s ^ 2s+1 \ 

2. If q < 2, we are in the non-homogeneous zone, where linear estimators do not attain the optimal 
rate of convergence. Instead, we have: 

i^ n (V, 2) /R n {V, 2) -» oo, asm oo. (7.22) 
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The second result is due to the spatial variability of functions in Sobolev spaces with small index q. Linear 
estimators are based on the idea of spatial homogeneity of the function and hence do perform poorly in 
the presence of non-homogeneous functions. In contrast, even if q < 2, the SureShrink estimator attains 
the minimax rate [25]. The same type of results holds for more general Besov spaces, see for example [28], 
Chapter 10. 

7.4 Adaptivity of wavelet estimator 

We just saw that a nonlinear wavelet estimator is able to estimate in an optimal way functions of inhomo- 
geneous regularity. However, it may not be sufficient to know that for m belonging to a given space, the 
estimator performs well. Indeed, in general we do not know which space the function belongs to. Hence it 
is of great interest to consider a scale of function classes and to look for an estimator that attains simul- 
taneously the best rates of convergence across the whole scale. For example, the L q — Sobolev scale is a 
set of Sobolev function classes W* (C) indexed by the parameters s and C, see (7.19) for the definition of a 
Sobolev class. We now formalize the notion of an adaptive estimator. 

Let A be a given set and let {!F a , a 6 A} be the scale of functional classes T a indexed by a G A. Denote 
R n (a,p) the minimax risk over T a for the L p — loss: 

R n {a 7 p) = inf sup E\\m n - m\\ p p . (7.23) 



Definition: The estimator m* is called rate adaptive for the L p — loss and the scale of classes 
T a , a € A if for any a £ A there exists c a > such that 

sup E\\m* n - m\\P < c a R n (a,p) Vn > 1. (7.24) 

The estimator m* n is called adaptive up to a logarithmic factor for the L p — loss and the scale of classes 
T a , a € A if for any a € A there exist c a > and 7 = j a > such that 

sup E\\m* n - m\\P < c a {lognf R n (a,p) Vn > 1. (7.25) 

Thus, adaptive estimators have an optimal rate of convergence and behave as if they know in advance in 
which class the function to be estimated lies. 

The VisuShrink procedure is adaptive up to a logarithmic factor for the L^— loss over every Besov, Holder 
and Sobolev class that is contained in C [0, 1], see Theorem 1.2 in [23]. The SureShrink estimator does better: 
it is adaptive for the L^— loss, for a large scale of Besov, Holder and Sobolev classes, see Theorem 1 in [25]. 

7.5 Conclusion 

In this chapter, we saw the basic properties of standard wavelet theory and explained how these are related 
to the construction of wavelet regression estimators. 
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